Diabetes Prediction Capstone Project¶
Machine Learning Workflow¶
- Define the Problem
- Gather Data
- Exploratory Data Analysis (EDA)
- Preprocess Data
- Choose a Model
- Train the Model
- Evaluate the Model
# Import all libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Load the dataset
df = pd.read_csv(r"C:\Users\austi\Downloads\diabetes_prediction_dataset.csv")
df.head()
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 80.0 | 0 | 1 | never | 25.19 | 6.6 | 140 | 0 |
| 1 | Female | 54.0 | 0 | 0 | No Info | 27.32 | 6.6 | 80 | 0 |
| 2 | Male | 28.0 | 0 | 0 | never | 27.32 | 5.7 | 158 | 0 |
| 3 | Female | 36.0 | 0 | 0 | current | 23.45 | 5.0 | 155 | 0 |
| 4 | Male | 76.0 | 1 | 1 | current | 20.14 | 4.8 | 155 | 0 |
df.tail(5)
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 99995 | Female | 80.0 | 0 | 0 | No Info | 27.32 | 6.2 | 90 | 0 |
| 99996 | Female | 2.0 | 0 | 0 | No Info | 17.37 | 6.5 | 100 | 0 |
| 99997 | Male | 66.0 | 0 | 0 | former | 27.83 | 5.7 | 155 | 0 |
| 99998 | Female | 24.0 | 0 | 0 | never | 35.42 | 4.0 | 100 | 0 |
| 99999 | Female | 57.0 | 0 | 0 | current | 22.43 | 6.6 | 90 | 0 |
Understanding the Data Structure¶
# Data Overview
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 100000 non-null object 1 age 100000 non-null float64 2 hypertension 100000 non-null int64 3 heart_disease 100000 non-null int64 4 smoking_history 100000 non-null object 5 bmi 100000 non-null float64 6 HbA1c_level 100000 non-null float64 7 blood_glucose_level 100000 non-null int64 8 diabetes 100000 non-null int64 dtypes: float64(3), int64(4), object(2) memory usage: 6.9+ MB
NOTE: The dataset consists of 100,000 rows and 9 columns. The columns are divided into 7 numerical and 2 categorical. There are no missing values.
# Check for duplicates
duplicates = df[df.duplicated]
print("Duplicate rows:")
print(duplicates)
# Check for missing values
df.isnull().sum()
Duplicate rows:
gender age hypertension heart_disease smoking_history bmi \
2756 Male 80.0 0 0 No Info 27.32
3272 Female 80.0 0 0 No Info 27.32
3418 Female 19.0 0 0 No Info 27.32
3939 Female 78.0 1 0 former 27.32
3960 Male 47.0 0 0 No Info 27.32
... ... ... ... ... ... ...
99980 Female 52.0 0 0 never 27.32
99985 Male 25.0 0 0 No Info 27.32
99989 Female 26.0 0 0 No Info 27.32
99990 Male 39.0 0 0 No Info 27.32
99995 Female 80.0 0 0 No Info 27.32
HbA1c_level blood_glucose_level diabetes
2756 6.6 159 0
3272 3.5 80 0
3418 6.5 100 0
3939 3.5 130 0
3960 6.0 200 0
... ... ... ...
99980 6.1 145 0
99985 5.8 145 0
99989 5.0 158 0
99990 6.1 100 0
99995 6.2 90 0
[3854 rows x 9 columns]
gender 0 age 0 hypertension 0 heart_disease 0 smoking_history 0 bmi 0 HbA1c_level 0 blood_glucose_level 0 diabetes 0 dtype: int64
# Remove duplicates rows, keeping the first occurance
df_no_duplicates = df.drop_duplicates()
print("DataFrame without duplicates:")
print(df_no_duplicates)
DataFrame without duplicates:
gender age hypertension heart_disease smoking_history bmi \
0 Female 80.0 0 1 never 25.19
1 Female 54.0 0 0 No Info 27.32
2 Male 28.0 0 0 never 27.32
3 Female 36.0 0 0 current 23.45
4 Male 76.0 1 1 current 20.14
... ... ... ... ... ... ...
99994 Female 36.0 0 0 No Info 24.60
99996 Female 2.0 0 0 No Info 17.37
99997 Male 66.0 0 0 former 27.83
99998 Female 24.0 0 0 never 35.42
99999 Female 57.0 0 0 current 22.43
HbA1c_level blood_glucose_level diabetes
0 6.6 140 0
1 6.6 80 0
2 5.7 158 0
3 5.0 155 0
4 4.8 155 0
... ... ... ...
99994 4.8 145 0
99996 6.5 100 0
99997 5.7 155 0
99998 4.0 100 0
99999 6.6 90 0
[96146 rows x 9 columns]
NOTE: The dataset contains NO null values
df.columns
Index(['gender', 'age', 'hypertension', 'heart_disease', 'smoking_history',
'bmi', 'HbA1c_level', 'blood_glucose_level', 'diabetes'],
dtype='object')
Initial Observations from the Dataset¶
The dataset contains 100,000 rows and 9 columns. Here are the key details:
Columns:
gender: Categorical (e.g., Male, Female)
age: Numerical
hypertension: Binary (0: No, 1: Yes)
heart_disease: Binary (0: No, 1: Yes)
smoking_history: Categorical (e.g., never, current)
bmi: Numerical (Body Mass Index)
HbA1c_level: Numerical (Hemoglobin A1c level)
blood_glucose_level: Numerical
diabetes: Target variable (Binary: 0: No, 1: Yes)
No Missing Values: All columns have complete data (non-null counts match the total row count).
Diabetes As Column of Interest¶
# To look at a specific column of interest - 'diabetes'
# Display basic statistics for the 'diabetes' column
print("Basic Statistics:")
print(df['diabetes'].describe())
# Display unique values in the 'diabetes' column
print("\nUnique Values:")
print(df['diabetes'].unique())
# Display counts of each unique value in the 'diabetes' column
print("\nValue Counts:")
print(df['diabetes'].value_counts())
Basic Statistics: count 100000.000000 mean 0.085000 std 0.278883 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 Name: diabetes, dtype: float64 Unique Values: [0 1] Value Counts: diabetes 0 91500 1 8500 Name: count, dtype: int64
NOTE: The Column of interest - Diabetes basic statistics is shown above. The Value Count is (Non-diabetic) 0 - 91,500; (Diabetic) - 1 is 8,500 patients.
Exploratory Data Analysis (EDA)¶
Descriptive Statistics¶
# Describe the data
df.describe()
| age | hypertension | heart_disease | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|
| count | 100000.000000 | 100000.00000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 |
| mean | 41.885856 | 0.07485 | 0.039420 | 27.320767 | 5.527507 | 138.058060 | 0.085000 |
| std | 22.516840 | 0.26315 | 0.194593 | 6.636783 | 1.070672 | 40.708136 | 0.278883 |
| min | 0.080000 | 0.00000 | 0.000000 | 10.010000 | 3.500000 | 80.000000 | 0.000000 |
| 25% | 24.000000 | 0.00000 | 0.000000 | 23.630000 | 4.800000 | 100.000000 | 0.000000 |
| 50% | 43.000000 | 0.00000 | 0.000000 | 27.320000 | 5.800000 | 140.000000 | 0.000000 |
| 75% | 60.000000 | 0.00000 | 0.000000 | 29.580000 | 6.200000 | 159.000000 | 0.000000 |
| max | 80.000000 | 1.00000 | 1.000000 | 95.690000 | 9.000000 | 300.000000 | 1.000000 |
Data visualization¶
# Plotting a histogram from?
import matplotlib.pyplot as plt
# Plotting a histogram for Age
plt.figure(figsize=(10, 6))
plt.hist(df['age'].dropna(), bins=30, edgecolor='black')
plt.title('Distribution of Patients Ages')
plt.xlabel('Age')
plt.ylabel('Number of Patients')
plt.show()
Age Distribution: Patients age is not normally distributed. There is skewness to the right indicating an older population. This is particularly relevant because age can be a significant predictor for diabetes.
Range and Central Tendency: The age range is between 0 to 80 and central tendency is within 42 years of ages. Diabetes risk often increases with age, so knowing the age spread can guide whether the population is high-risk.
Outliers: There are no unusual spikes or gaps, as they might indicate data collection errors or specific population characteristics. For example, a gap between age groups could suggest missing demographic data or sampling bias
Prediction Model Implications: Understanding age distribution is crucial because it allows the model to account for the non-linear relationship between age and diabetes risk. Feature engineering might benefit from categorizing age into bins/groups, such as 20-30, 31-40, etc., to improve the model's sensitivity to age-related trends.
# Bar Plot of hypertension distribution
import matplotlib.pyplot as plt
# Count each gender category
gender_counts = df['hypertension'].value_counts()
# Plotting the bar chart with adjusted y-axis limits
plt.figure(figsize=(8, 6))
gender_counts.plot(kind='bar', color=['skyblue', 'salmon', 'gray'])
plt.title("Distribution of Hypertension")
plt.xlabel("hypertension")
plt.ylabel("No of Paitents")
plt.ylim(0, max(gender_counts) + 20) # Adjust y-axis limit for better visibility of 'Other'
plt.xticks(rotation=0)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# Group by gender and calculate the count of hypertension cases
hypertension_distribution = df.groupby('gender')['hypertension'].value_counts().unstack()
# Display the distribution
print(hypertension_distribution)
# Plotting the distribution as a bar chart
hypertension_distribution.plot(kind='bar', figsize=(10, 6), stacked=True, color=['skyblue', 'salmon'])
plt.title('Distribution of Hypertension by Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Patients')
plt.legend(title='Hypertension', labels=['No (0)', 'Yes (1)'])
plt.xticks(rotation=0)
plt.show()
hypertension 0 1 gender Female 54355.0 4197.0 Male 38142.0 3288.0 Other 18.0 NaN
NOTE:
Hypertension Prevalence: The chart does not show a higher count of hypertension cases, indicating no significant prevalence within the population. Since hypertension is a known comorbidity with diabetes, a high prevalence can reinforce the importance of this feature in the prediction model.
Gender or Category Distribution: The chart highlights gender-based differences in hypertension prevalence. The ratio of female : male gender is about 4.2% : 3.3%. Thus higher counts in female gender categories suggests a gender-based risk factor, which may be valuable for model accuracy.
Feature Engineering: Given that hypertension is a comorbid condition with diabetes, I used a binary feature in the model (e.g., 1 for hypertension, 0 for no hypertension). Also I explored interaction terms between hypertension and other relevant features like age, BMI, or lifestyle factors for improved model insights.
Public Health Insight: High hypertension counts suggests the need for public health initiatives targeting hypertension management, as controlling it can improve diabetes outcomes.
import matplotlib.pyplot as plt
# Count each gender category
gender_counts = df['gender'].value_counts()
# Plotting the bar chart with adjusted y-axis limits
plt.figure(figsize=(8, 6))
gender_counts.plot(kind='bar', color=['skyblue', 'salmon', 'gray'])
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("No of Paitents")
plt.ylim(0, max(gender_counts) + 20) # Adjust y-axis limit for better visibility of 'Other'
plt.xticks(rotation=0)
plt.show()
Gender Distribution: This visualization provides insight into the gender balance within the dataset. There is no significant imbalance (e.g., more males than females or vice versa), that could influence the model's generalizability. Diabetes risk factors and presentation can vary between genders, so a balanced distribution or proper handling of any imbalance is essential.
Model Performance and Bias: No one gender is underrepresented, therefore, the model's performance for any gender will not be affected.
# Scatter plot with diabetes
fig = px.scatter(df, x='age', y='bmi', color='diabetes', title='Age vs BMI (Color-coded by diabetes)',
labels={'diabetes': 'Diabetes', 'age': 'Patient Age', 'bmi': 'Body Mass Index'})
fig.show()
Age-BMI Relationship: This plot examined any correlation between age and BMI among patients, both with and without diabetes. Typically, higher BMI is associated with a higher risk of diabetes, and this relationship may strengthen with age. There are no clusters of high-BMI individuals at older ages with diabetes indicators, which would align with known risk patterns.
Pattern Analysis for Prediction: If individuals with diabetes did not cluster at higher BMI and age ranges, as this could signify a risk threshold.
Color-Coded Insights: By using color to differentiate diabetes status, I can visually assess how well-separated the groups are based on age and BMI. A clear separation as seen above indicates that age and BMI are strong predictive features. If separation is weak, it may suggest that additional features (e.g., lifestyle or genetic factors) are needed for more accurate predictions.
1.0 Univariate Analysis¶
## Univariate Analysis - Boxplots for Numerical Columns
columns_to_check = ['age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level', 'diabetes']
plt.figure(figsize=(18, 10))
for i, col in enumerate(columns_to_check, 1):
plt.subplot(3, 3, i)
sns.boxplot(df[col])
plt.xlabel(col)
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
Treatment of Outliers¶
#Dropping the outliers by setting a cap on bmi, blood_glucose_level and HbA1c level.
df2 = df[(df["bmi"]<90)]
df2 = df[(df["blood_glucose_level"]<300)]
df2 = df[(df["HbA1c_level"]<8)]
print(f"The total number of data-points after removing the outliers are: {len(df2)}")
The total number of data-points after removing the outliers are: 98024
- After dropping the outliers on bmi, blood_glucose_level and HbA1c level.
## Univariate Analysis - Boxplots for Numerical Columns
columns_to_check = ['age', 'hypertension', 'heart_disease', 'bmi', 'HbA1c_level', 'blood_glucose_level', 'diabetes']
plt.figure(figsize=(18, 10))
for i, col in enumerate(columns_to_check, 1):
plt.subplot(3, 3, i)
sns.boxplot(df2[col])
plt.xlabel(col)
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
- No more outliers
# Setting up the subplot grid
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# bmi distribution
axes[0].hist(df2['bmi'].dropna(), bins=30, edgecolor='black')
axes[0].set_title('Distribution of Patients BMIs')
axes[0].set_xlabel('bmi')
axes[0].set_ylabel('Number of Patients')
# gender distribution
axes[1].hist(df2['gender'], bins=30, edgecolor='black', color='orange')
axes[1].set_title('Distribution of Gender')
axes[1].set_xlabel('Gender')
axes[1].set_ylabel('Number of Patients')
# heart_disease distribution
axes[2].bar(df2['heart_disease'].value_counts().index, df['heart_disease'].value_counts(), color='green')
axes[2].set_title('Distribution of heart_disease')
axes[2].set_xlabel('Patients Class')
axes[2].set_ylabel('Number of Patients')
plt.tight_layout()
plt.show()
# Setting up the subplot grid
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# HbA1c_level distribution
axes[0].barh(df2['HbA1c_level'].value_counts().index, df2['HbA1c_level'].value_counts(), color=['skyblue', 'pink'])
axes[0].set_title('Distribution of HbA1c_level')
axes[0].set_xlabel('HbA1c_level')
axes[0].set_ylabel('Number of Patients')
# blood_glucose_level distribution
axes[1].bar(df2['blood_glucose_level'].value_counts().index, df2['blood_glucose_level'].value_counts(), color='purple')
axes[1].set_title('Distribution of blood_glucose_level')
axes[1].set_xlabel('blood_glucose_level')
axes[1].set_ylabel('Number of Patients')
# diabetes distribution
axes[2].bar(df2['diabetes'].value_counts().index, df2['diabetes'].value_counts(), color=['red', 'green'])
axes[2].set_title('Distribution of diabetes')
axes[2].set_xlabel('diabetes (0 = No, 1 = Yes)')
axes[2].set_ylabel('Number of Patients')
plt.show()
Categorical Features Analysis¶
import matplotlib.pyplot as plt
# Setting up the subplot grid
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
# Gender distribution
axes[0].bar(df2['gender'].value_counts().index, df2['gender'].value_counts(), color=['blue', 'pink'])
axes[0].set_title('Distribution of Gender')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Number of Passengers')
# Count each gender category with the exact values
gender_counts = df2['gender'].value_counts().reindex(['Male', 'Female', 'Other'], fill_value=0)
# smoking_history distribution
axes[1].hist(df2['smoking_history'], bins=30, edgecolor='black', color='orange')
axes[1].set_title('Distribution of smoking_history')
axes[1].set_xlabel('smoking_history')
axes[1].set_ylabel('Number of Patients')
plt.show()
Key Observations from Univariate Analysis¶
Age: The dataset covers a wide age range, with a higher concentration of individuals in middle age.
BMI (Body Mass Index): Most values are centered around the typical BMI range (20-30), indicating a mix of healthy and overweight individuals.
HbA1c Level: Distribution shows a concentration around normal levels (4-6), with a few higher outliers.
Blood Glucose Level: There’s a noticeable spread, with some individuals having significantly high glucose levels.
2. 0 Bivariate Analysis¶
import seaborn as sns
import matplotlib.pyplot as plt
# Bivariate analysis: Relationship between numerical features and the target
key_features = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for feature, ax in zip(key_features, axes.ravel()):
sns.boxplot(x='diabetes', y=feature, data=df2, ax=ax)
ax.set_title(f'{feature} vs Diabetes')
ax.set_xlabel('Diabetes (0: No, 1: Yes)')
ax.set_ylabel(feature)
plt.tight_layout()
plt.show()
General Insights¶
Strong Predictors:
- HbA1c Level and Blood Glucose Level show clear separations between diabetic and non-diabetic individuals, making them key predictors for modeling diabetes risk.
Moderate Predictors:
- Age demonstrates a clear trend of increased risk in older populations, while BMI shows a weaker but notable association.
Potential Outliers:
- Extreme outliers in BMI and Blood Glucose Level for diabetic individuals might indicate specific subgroups requiring further investigation.
import matplotlib.pyplot as plt
# Setting up the subplot grid
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
# Count each gender category with the exact values
gender_counts = df2['gender'].value_counts().reindex(['Male', 'Female', 'Other'], fill_value=0)
# Plotting the bar chart
# gender_counts.plot(kind='bar', color=['skyblue', 'salmon', 'gray'])
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.xticks(rotation=0)
# diabetes distribution
axes[0].bar(df2['diabetes'].value_counts().index, df2['diabetes'].value_counts(), color=['red', 'green'])
axes[0].set_title('Distribution of diabetes')
axes[0].set_xlabel('diabetes (0 = No, 1 = Yes)')
axes[0].set_ylabel('Number of Patients')
# Count each gender category with the exact values
gender_counts = df2['gender'].value_counts().reindex(['Male', 'Female', 'Other'], fill_value=0)
# Plotting the bar chart
#gender_counts.plot(kind='bar', color=['skyblue', 'salmon', 'gray'])
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.xticks(rotation=0)
# diabetes distribution
axes[1].hist(df2['bmi'], bins=30, edgecolor='black', color='red')
axes[1].set_title('Distribution of BMI')
axes[1].set_xlabel('bmi')
axes[1].set_ylabel('Number of Patients')
plt.show()
1. Distribution of Diabetes (Left Plot)
Observation:
- The majority of patients are non-diabetic (diabetes = 0), represented by the large red bar.
- Only a small fraction of patients are diabetic (diabetes = 1), represented by the smaller green bar.
Insight:
- The dataset is highly imbalanced, with a much larger proportion of non-diabetic patients compared to diabetic patients.
- This imbalance could affect model performance, especially recall for the minority class (diabetes = 1), and should be addressed through techniques like oversampling (e.g., SMOTE), undersampling, or class weighting.
2. Distribution of BMI (Right Plot)
Observation:
- The distribution of BMI is right-skewed, with most patients having BMI values clustered between 20 and 30.
- A small number of patients exhibit very high BMI values (greater than 40), which are potential outliers.
Insight:
- BMI is a critical health indicator, and the skewness suggests that while most individuals are in a healthy-to-moderately high BMI range, the outliers could represent individuals at higher risk for conditions like diabetes.
- Addressing the outliers (e.g., Winsorization or transformation) could help stabilize the model and reduce the impact of extreme values.
# diabetes rate by gender
plt.figure(figsize=(8, 5))
df2.groupby('gender')['diabetes'].mean().plot(kind='bar', color=['gold', 'silver', 'black'])
plt.title('Diabetes Rate by Patient Gender')
plt.ylabel('Diabetes Rate')
plt.show()
# diabetes rate by hypertension
plt.figure(figsize=(8, 5))
df2.groupby('smoking_history')['diabetes'].mean().plot(kind='bar', color=['gold', 'silver', 'black', 'skyblue', 'green', 'orange'])
plt.title('Diabetes Rate by Smoking_history')
plt.ylabel('smoking_history')
plt.show()
Observations¶
Highest Diabetes Rate:
- Former Smokers exhibit the highest diabetes rate (~14%). This indicates that individuals with a history of smoking are at significantly higher risk for diabetes compared to other groups.
Moderate Diabetes Rates:
- Current Smokers, individuals who have smoked "ever", and those classified as "not current" all show moderate diabetes rates (~8%-10%), which are higher than non-smokers but lower than former smokers.
Lower Diabetes Rates:
- Individuals with "No Info" and those who "Never Smoked" show the lowest diabetes rates (~2%-5%). Non-smokers generally have the lowest risk, reinforcing the link between smoking and diabetes.
Insights¶
Smoking History Is a Significant Factor:
- Smoking, particularly a past history of smoking (former smokers), appears to be strongly associated with an increased risk of diabetes.
- This could be due to long-term physiological impacts of smoking on metabolism, insulin resistance, or inflammation, which persist even after cessation.
Current Smokers at Risk:
- The diabetes rate among current smokers highlights the ongoing risk posed by active smoking. Smoking cessation could reduce this risk over time, but residual effects may still persist.
Individuals with "No Info":
- This group has the lowest diabetes rate. It might include healthier individuals or younger populations who haven’t reported their smoking status.
import plotly.express as px
# Scatter plot with diabetes
fig = px.scatter(df2, x='age', y='hypertension', color='diabetes', title='Age vs Hypertension (Color-coded by diabetes)',
labels={'diabetes': 'Diabetes', 'age': 'Patient Age', 'hypertension': 'Hypertension'})
fig.show()
Key Observations from Bivariate Analysis¶
Age vs Diabetes: Older individuals appear to have a higher likelihood of being diabetic, as shown by a higher median age in the diabetes-positive group.
BMI vs Diabetes: The BMI distribution for diabetic individuals shows a slight shift toward higher values compared to non-diabetic individuals.
HbA1c Level vs Diabetes: There’s a noticeable difference, with diabetic individuals generally having significantly higher HbA1c levels.
Blood Glucose Level vs Diabetes: Blood glucose levels are distinctly higher in the diabetic group, showing this feature’s strong potential as a predictor.
3.0 Multivariate Analysis¶
Multivariate Analysis - Correlation Heatmap¶
# Multivariate Analysis - Correlation Heatmap
correlation_matrix = df2[columns_to_check].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()
#correlation Map (Heatmap) ---
c = df2.drop(["gender","smoking_history"],axis=1)
mask = np.triu(np.ones_like(c.corr(), dtype=bool))
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(c.corr(), mask=mask, annot=True, cmap='inferno', linewidths=0.1, cbar=True, annot_kws={"size":5})
yticks, ylabels = plt.yticks()
xticks, xlabels = plt.xticks()
ax.set_xticklabels(xlabels, size=6, fontfamily='serif')
ax.set_yticklabels(ylabels, size=6, fontfamily='serif')
plt.suptitle('Correlation Map of Numerical Variables', fontweight='heavy', x=0.327, y=0.96, ha='left', fontsize=13, fontfamily='serif')
plt.tight_layout(rect=[0, 0.04, 1, 1.01])
plt.gcf().text(0.85, 0.05, 'kaggle.com/caesarmario', style='italic', fontsize=5);
plt.show();
Observations from Correlation Analysis¶
-Diabetes Correlations:
HbA1c Level: Shows the strongest positive correlation with diabetes.
Blood Glucose Level: Also has a strong positive correlation with diabetes.
Age: Positively correlated, indicating that age plays a role in diabetes risk.
- BMI: Weak correlation, but it may still be a contributing factor.
Feature Intercorrelations:
There is a moderate positive correlation between HbA1c level and blood glucose level, suggesting some overlap in information.
import seaborn as sns
# Diabetes rate by Hypertension and Gender
plt.figure(figsize=(10, 6))
sns.barplot(x='hypertension', y='diabetes', hue='gender', data=df2, palette='viridis')
plt.title('Diabetes Rate and Hypertension by Gender')
plt.ylabel('Diabetes Rate')
plt.show()
Observations¶
Hypertension (0 = No, 1 = Yes):
Non-Hypertensive Individuals (Hypertension = 0):
Both genders have low diabetes rates, with males showing slightly higher rates than females.
Hypertensive Individuals (Hypertension = 1):
The diabetes rate increases significantly for all genders, indicating a strong association between hypertension and diabetes.
Males have a slightly higher diabetes rate compared to females among hypertensive individuals.
Gender Comparison:
- Across both hypertensive and non-hypertensive groups, males exhibit higher diabetes rates than females.
- The "Other" gender category also shows higher diabetes rates when hypertensive, but sample size might influence this observation.
Insights¶
Strong Link Between Hypertension and Diabetes:
- Hypertension appears to be a significant risk factor for diabetes, with hypertensive individuals showing dramatically higher diabetes rates across all genders.
- This aligns with clinical findings that hypertension and diabetes often co-occur due to shared metabolic and cardiovascular risk factors.
Gender Differences:
- Males consistently show higher diabetes rates compared to females, suggesting possible gender-specific risk factors or lifestyle differences influencing diabetes risk.
Non-Hypertensive Group:
- The lower diabetes rates in non-hypertensive individuals suggest that maintaining normal blood pressure might help reduce diabetes risk, irrespective of gender.
# Analyze categorical features: Gender and Smoking History
categorical_features = ['gender', 'smoking_history']
# Plot categorical features vs diabetes
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
for feature, ax in zip(categorical_features, axes):
sns.countplot(x=feature, hue='diabetes', data=df2, ax=ax)
ax.set_title(f'{feature} Distribution by Diabetes')
ax.set_xlabel(feature)
ax.set_ylabel('Count')
ax.legend(title='Diabetes', loc='upper right')
plt.tight_layout()
plt.show()
Observations from Categorical Analysis¶
1. Gender vs Diabetes:¶
- The distribution of diabetes cases seems fairly balanced between males and females, suggesting gender alone may not be a strong predictor.
2. Smoking History vs Diabetes:¶
- Individuals with a "current" smoking history show a higher proportion of diabetes cases compared to "never" or "No Info" groups.
- Smoking habits might contribute to diabetes risk but will likely need further exploration in conjunction with other features.
# Pairplot of numerical features
sns.pairplot(df2[['age', 'hypertension', 'heart_disease', 'diabetes']], hue='hypertension', palette=['skyblue', 'orange'])
plt.suptitle('Pairplot of Numerical Features', y=1.02)
plt.show()
Key Observations from Multivariate Analysis¶
Age and Diabetes: Diabetic individuals tend to cluster toward higher age ranges.
HbA1c Level and Blood Glucose Level: A strong relationship is evident, with higher values correlating with diabetes presence.
BMI: While it shows some overlap, there are slight tendencies for higher BMIs in the diabetic group.
This visualization helps highlight key relationships and interactions among features and the target variable (diabetes).
Data Preparation for Machine Learning¶
# Import liberaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
Drop Default Column from Dataset¶
# Drop the default column for rediction from the dataset - Default Column
X = df2.drop(columns= ['diabetes'])
y = df2['diabetes']
X.head()
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | |
|---|---|---|---|---|---|---|---|---|
| 0 | Female | 80.0 | 0 | 1 | never | 25.19 | 6.6 | 140 |
| 1 | Female | 54.0 | 0 | 0 | No Info | 27.32 | 6.6 | 80 |
| 2 | Male | 28.0 | 0 | 0 | never | 27.32 | 5.7 | 158 |
| 3 | Female | 36.0 | 0 | 0 | current | 23.45 | 5.0 | 155 |
| 4 | Male | 76.0 | 1 | 1 | current | 20.14 | 4.8 | 155 |
y.head()
0 0 1 0 2 0 3 0 4 0 Name: diabetes, dtype: int64
Import the Train & Test Split¶
from sklearn.model_selection import train_test_split
# Define the feature (inputs) and the target (output
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42) # test_size = 20% or 0.2 while 80% is used for training
# Check the 80% data for training the model
X_train.head()
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | |
|---|---|---|---|---|---|---|---|---|
| 34720 | Male | 8.0 | 0 | 0 | No Info | 16.06 | 3.5 | 155 |
| 17637 | Male | 39.0 | 0 | 0 | never | 27.56 | 6.6 | 126 |
| 48762 | Male | 46.0 | 0 | 0 | No Info | 24.28 | 4.8 | 200 |
| 12777 | Male | 53.0 | 0 | 0 | never | 23.29 | 6.1 | 155 |
| 72961 | Male | 36.0 | 0 | 0 | never | 31.29 | 5.0 | 155 |
# Check the data size
X.shape
(98024, 8)
# Check the data size for training and testing
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((78419, 8), (19605, 8), (78419,), (19605,))
Encode Data Using OneHotEncoder¶
# To encode the categorical columns of the dataset using OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
categorical_col = df2.select_dtypes(exclude='number').columns
categorical_col
Index(['gender', 'smoking_history'], dtype='object')
X_train_encoded = encoder.fit_transform(X_train[categorical_col])
X_train_encoded
array([[0., 1., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 1., 0.],
[0., 1., 0., ..., 0., 0., 0.],
...,
[0., 1., 0., ..., 0., 0., 1.],
[1., 0., 0., ..., 0., 1., 0.],
[1., 0., 0., ..., 0., 1., 0.]])
#X_train_encoded
encoder.get_feature_names_out(categorical_col)
array(['gender_Female', 'gender_Male', 'gender_Other',
'smoking_history_No Info', 'smoking_history_current',
'smoking_history_ever', 'smoking_history_former',
'smoking_history_never', 'smoking_history_not current'],
dtype=object)
X_train_encoded_df2 = pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out(categorical_col))
X_train_encoded_df2.head()
| gender_Female | gender_Male | gender_Other | smoking_history_No Info | smoking_history_current | smoking_history_ever | smoking_history_former | smoking_history_never | smoking_history_not current | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
X_test_encoded = encoder.transform(X_test[categorical_col])
X_test_encoded
array([[0., 1., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 1.],
[1., 0., 0., ..., 0., 1., 0.],
...,
[0., 1., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 1., 0.]])
X_test_encoded_df2 = pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out(categorical_col))
X_test_encoded_df2.head()
| gender_Female | gender_Male | gender_Other | smoking_history_No Info | smoking_history_current | smoking_history_ever | smoking_history_former | smoking_history_never | smoking_history_not current | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Reset both the Train and Test Datasets¶
X_train_encoded_df2.reset_index(drop=True, inplace=True)
X_test_encoded_df2.reset_index(drop=True, inplace=True)
X_train_encoded_df2.head()
| gender_Female | gender_Male | gender_Other | smoking_history_No Info | smoking_history_current | smoking_history_ever | smoking_history_former | smoking_history_never | smoking_history_not current | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
X_train_rem = X_train.drop(columns=categorical_col).reset_index(drop=True)
X_test_rem = X_test.drop(columns=categorical_col).reset_index(drop=True)
X_train_rem.head()
| age | hypertension | heart_disease | bmi | HbA1c_level | blood_glucose_level | |
|---|---|---|---|---|---|---|
| 0 | 8.0 | 0 | 0 | 16.06 | 3.5 | 155 |
| 1 | 39.0 | 0 | 0 | 27.56 | 6.6 | 126 |
| 2 | 46.0 | 0 | 0 | 24.28 | 4.8 | 200 |
| 3 | 53.0 | 0 | 0 | 23.29 | 6.1 | 155 |
| 4 | 36.0 | 0 | 0 | 31.29 | 5.0 | 155 |
Concate the Train and Test Datasets¶
X_train = pd.concat([X_train_rem, X_train_encoded_df2], axis=1)
X_test = pd.concat([X_test_rem, X_test_encoded_df2], axis=1)
X_train.head()
| age | hypertension | heart_disease | bmi | HbA1c_level | blood_glucose_level | gender_Female | gender_Male | gender_Other | smoking_history_No Info | smoking_history_current | smoking_history_ever | smoking_history_former | smoking_history_never | smoking_history_not current | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.0 | 0 | 0 | 16.06 | 3.5 | 155 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 39.0 | 0 | 0 | 27.56 | 6.6 | 126 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 46.0 | 0 | 0 | 24.28 | 4.8 | 200 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 53.0 | 0 | 0 | 23.29 | 6.1 | 155 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 36.0 | 0 | 0 | 31.29 | 5.0 | 155 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
X_test.head()
| age | hypertension | heart_disease | bmi | HbA1c_level | blood_glucose_level | gender_Female | gender_Male | gender_Other | smoking_history_No Info | smoking_history_current | smoking_history_ever | smoking_history_former | smoking_history_never | smoking_history_not current | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 64.0 | 1 | 0 | 27.32 | 5.0 | 90 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 44.0 | 0 | 0 | 36.16 | 4.0 | 160 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 53.0 | 0 | 0 | 29.16 | 6.5 | 200 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 80.0 | 1 | 1 | 25.78 | 6.6 | 100 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 80.0 | 1 | 0 | 23.27 | 6.0 | 126 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Scaling the Dataset with Standard Scaler¶
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X_train
array([[-1.4892923 , -0.27723391, -0.19599664, ..., -0.31822431,
-0.7341151 , -0.26191053],
[-0.11063885, -0.27723391, -0.19599664, ..., -0.31822431,
1.36218422, -0.26191053],
[ 0.20066999, -0.27723391, -0.19599664, ..., -0.31822431,
-0.7341151 , -0.26191053],
...,
[-1.00009269, -0.27723391, -0.19599664, ..., -0.31822431,
-0.7341151 , 3.81809776],
[ 0.28961537, -0.27723391, -0.19599664, ..., -0.31822431,
1.36218422, -0.26191053],
[ 1.40143267, -0.27723391, -0.19599664, ..., -0.31822431,
1.36218422, -0.26191053]])
X_test
array([[ 0.99950423, 3.59499731, -0.19847259, ..., -0.31678636,
-0.73381995, -0.26006924],
[ 0.10924395, -0.27816432, -0.19847259, ..., -0.31678636,
-0.73381995, 3.84512993],
[ 0.50986108, -0.27816432, -0.19847259, ..., -0.31678636,
1.3627321 , -0.26006924],
...,
[-0.29137318, -0.27816432, -0.19847259, ..., -0.31678636,
-0.73381995, -0.26006924],
[ 1.35560834, -0.27816432, -0.19847259, ..., -0.31678636,
-0.73381995, -0.26006924],
[ 0.28729601, -0.27816432, -0.19847259, ..., -0.31678636,
1.3627321 , -0.26006924]])
1. Testing with Logistic Regression¶
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(random_state=42, max_iter=1000)
log_model.fit(X_train, y_train)
LogisticRegression(max_iter=1000, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=42)
y_pred_log = log_model.predict(X_test)
y_pred_log
array([0, 0, 0, ..., 0, 1, 0], dtype=int64)
y_test
33902 0
58831 0
50739 0
22151 0
39194 0
..
14720 0
13731 0
15916 0
76974 1
81426 0
Name: diabetes, Length: 19605, dtype: int64
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix
accuracy_log = accuracy_score(y_test, y_pred_log)
print(round(accuracy_log, 2))
0.96
print(classification_report(y_test, y_pred_log))
precision recall f1-score support
0 0.97 0.99 0.98 18318
1 0.81 0.50 0.62 1287
accuracy 0.96 19605
macro avg 0.89 0.75 0.80 19605
weighted avg 0.96 0.96 0.96 19605
import seaborn as sns
sns.countplot(x='diabetes', data=df2)
<Axes: xlabel='diabetes', ylabel='count'>
conf_log = confusion_matrix(y_test, y_pred_log)
sns.heatmap(conf_log, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual');
Building the Model¶
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score
import numpy as np
# X_train, y_train as training data and labels
warnings.filterwarnings("ignore")
Initialize the Models to Use¶
# Initialize Models
models = {
"Logistic Regression": LogisticRegression(random_state=42),
"Decision Tree": DecisionTreeClassifier(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42),
"Gradient Boosting": GradientBoostingClassifier(random_state=42),
"Support Vector Classifier": SVC(random_state=42),
}
2. Testing with Random Forest Classifier¶
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
y_pred_rf = rf_model.predict(X_test)
y_pred_rf
array([0, 0, 0, ..., 0, 1, 0], dtype=int64)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(round(accuracy_rf, 2))
0.97
print(classification_report(y_test, y_pred_rf))
precision recall f1-score support
0 0.97 1.00 0.98 18318
1 0.91 0.57 0.70 1287
accuracy 0.97 19605
macro avg 0.94 0.78 0.84 19605
weighted avg 0.97 0.97 0.96 19605
conf_log = confusion_matrix(y_test, y_pred_log)
sns.heatmap(conf_log, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual');
3. Testing with Gradient Boosting Classifier¶
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
GradientBoostingClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=42)
y_pred_gb = gb_model.predict(X_test)
y_pred_gb
array([0, 0, 0, ..., 0, 1, 0], dtype=int64)
gb_accuracy = accuracy_score(y_test, y_pred_gb)
print(round(gb_accuracy, 2))
0.97
print(classification_report(y_test, y_pred_gb))
precision recall f1-score support
0 0.97 1.00 0.98 18318
1 0.98 0.56 0.71 1287
accuracy 0.97 19605
macro avg 0.97 0.78 0.85 19605
weighted avg 0.97 0.97 0.97 19605
conf_log = confusion_matrix(y_test, y_pred_log)
sns.heatmap(conf_log, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual');
Trying Out Other Models¶
## Train and Evaluate Models
for model_name, model in models.items():
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred) * 100
report = classification_report(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
# Display Results
print(f"{model_name} \n")
print(f"Accuracy: {accuracy:.2f}%")
print("Classification Report: \n", report)
print("Confusion Matrix: \n", matrix)
plt.figure(figsize=(4, 2))
sns.heatmap(matrix, annot=True, cmap="YlGnBu", fmt='g')
plt.show()
Logistic Regression
Accuracy: 95.96%
Classification Report:
precision recall f1-score support
0 0.97 0.99 0.98 18318
1 0.81 0.50 0.62 1287
accuracy 0.96 19605
macro avg 0.89 0.75 0.80 19605
weighted avg 0.96 0.96 0.96 19605
Confusion Matrix:
[[18168 150]
[ 642 645]]
Decision Tree
Accuracy: 94.91%
Classification Report:
precision recall f1-score support
0 0.98 0.97 0.97 18318
1 0.60 0.65 0.63 1287
accuracy 0.95 19605
macro avg 0.79 0.81 0.80 19605
weighted avg 0.95 0.95 0.95 19605
Confusion Matrix:
[[17766 552]
[ 446 841]]
Random Forest
Accuracy: 96.83%
Classification Report:
precision recall f1-score support
0 0.97 1.00 0.98 18318
1 0.91 0.57 0.70 1287
accuracy 0.97 19605
macro avg 0.94 0.78 0.84 19605
weighted avg 0.97 0.97 0.96 19605
Confusion Matrix:
[[18249 69]
[ 553 734]]
Gradient Boosting
Accuracy: 97.03%
Classification Report:
precision recall f1-score support
0 0.97 1.00 0.98 18318
1 0.98 0.56 0.71 1287
accuracy 0.97 19605
macro avg 0.97 0.78 0.85 19605
weighted avg 0.97 0.97 0.97 19605
Confusion Matrix:
[[18300 18]
[ 564 723]]
Support Vector Classifier
Accuracy: 96.10%
Classification Report:
precision recall f1-score support
0 0.96 1.00 0.98 18318
1 0.94 0.43 0.59 1287
accuracy 0.96 19605
macro avg 0.95 0.72 0.79 19605
weighted avg 0.96 0.96 0.95 19605
Confusion Matrix:
[[18285 33]
[ 731 556]]
Evaluating the Different Models for Selection¶
# Evaluation
## Define the parameter grid for Logistic Regression
log_reg_param_grid = {
'C': np.logspace(-4, 4, 10),
'solver': ['liblinear', 'saga'],
'penalty': ['l1', 'l2'],
'max_iter': [100, 200, 500]
}
## Initialize Logistic Regression model
log_reg = LogisticRegression()
# Randomized search for Logistic Regression
log_reg_random_search = RandomizedSearchCV(estimator=log_reg,
param_distributions=log_reg_param_grid,
n_iter=10, # Number of random combinations to try
scoring='roc_auc',
cv=5, # 5-fold cross-validation
verbose=2, # Show progress
random_state=42,
n_jobs=-1) # Use all available cores
## Fit model
log_reg_random_search.fit(X_train, y_train)
best_log_reg = log_reg_random_search.best_estimator_
Fitting 5 folds for each of 10 candidates, totalling 50 fits
## Define the parameter grid for Random Forest
rf_param_grid = {
'n_estimators': [100, 200, 500],
'max_features': ['auto', 'sqrt'],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
## Initialize Random Forest model
rf = RandomForestClassifier()
## Randomized search for Random Forest
rf_random_search = RandomizedSearchCV(estimator=rf,
param_distributions=rf_param_grid,
n_iter=10, # Number of random combinations to try
scoring='roc_auc',
cv=5, # 5-fold cross-validation
verbose=2, # Show progress
random_state=42,
n_jobs=-1) # Use all available cores
## Fit model
rf_random_search.fit(X_train, y_train)
best_rf = rf_random_search.best_estimator_
Fitting 5 folds for each of 10 candidates, totalling 50 fits
## Define the parameter grid for Gradient Boosting
gb_param_grid = {
'n_estimators': [100, 200, 500],
'max_features': ['auto', 'sqrt'],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
## Initialize Gradient Boosting model
gb = GradientBoostingClassifier()
## Randomized search for Gradient Boosting
gb_random_search = RandomizedSearchCV(estimator=rf,
param_distributions=rf_param_grid,
n_iter=10, # Number of random combinations to try
scoring='roc_auc',
cv=5, # 5-fold cross-validation
verbose=2, # Show progress
random_state=42,
n_jobs=-1) # Use all available cores
## Fit model
gb_random_search.fit(X_train, y_train)
best_gb = gb_random_search.best_estimator_
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Plotting the ROC Curve for each Model¶
# Calculate FPR, TPR, and threshold values
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_pred)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, label="Random Forest")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
# Calculate FPR, TPR, and threshold values
fpr_log, tpr_log, thresholds_log = roc_curve(y_test, y_pred)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, label="Logistic Regression")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
# Calculate FPR, TPR, and threshold values
fpr_gb, tpr_gb, thresholds_gb = roc_curve(y_test, y_pred)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(fpr_gb, tpr_gb, label="Gradient Boosting")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
Visualizing All the Models Curves Together¶
from sklearn.metrics import roc_curve, auc
# Calculate fpr, tpr, and AUC for Logistic Regression, Gradient Boosting and Random Forest
fpr_log, tpr_log, _ = roc_curve(y_test, y_pred_log) # Logistic Regression predictions
auc_log = auc(fpr_log, tpr_log)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf) # Random Forest predictions
auc_rf = auc(fpr_rf, tpr_rf)
fpr_gb, tpr_gb, _ = roc_curve(y_test, y_pred_gb) # Gradiet boosting predictions
auc_gb = auc(fpr_gb, tpr_gb)
# Plotting
plt.figure(figsize=(8, 6))
# Logistic Regression ROC curve
plt.plot(fpr_log, tpr_log, color='blue', label=f'Logistic Regression (AUC = {auc_log:.2f})')
# Random Forest ROC curve
plt.plot(fpr_rf, tpr_rf, color='green', label=f'Random Forest (AUC = {auc_rf:.2f})')
# Gradiet Boosting ROC curve
plt.plot(fpr_gb, tpr_gb, color='purple', label=f'Gradiet Boosting (AUC = {auc_gb:.2f})')
# Reference line for random guessing
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Gradient Boosting vs Logistic Regression & Random Forest')
plt.legend(loc='lower right')
plt.show()
Model Comparison: AUC Scores¶
- 1. Logistic Regression (AUC = 0.75):
- Logistic Regression achieves a moderate AUC of 0.75, indicating its capability to reasonably distinguish between positive (diabetic) and negative (non-diabetic) cases.
- Strengths:
- Simplicity and interpretability.
- Performs well with linear relationships between features and the target.
- Weaknesses:
- Struggles with non-linear relationships and complex feature interactions.
- 2. Random Forest (AUC = 0.78):
- Random Forest shows better performance (AUC = 0.78), likely due to its ability to handle non-linear relationships and interactions between features.
- Strengths:
- Robust to overfitting due to ensemble averaging.
- Handles high-dimensional data well.
- Weaknesses:
- Less interpretable than Logistic Regression.
- 3. Gradient Boosting (AUC = 0.78):
- Gradient Boosting matches Random Forest in terms of AUC, suggesting comparable ability in distinguishing positive and negative cases.
- Strengths:
- Boosting focuses on hard-to-predict cases, potentially improving overall accuracy.
- Often outperforms Random Forest when hyperparameters are finely tuned.
- Weaknesses:
- Computationally intensive.
- Sensitive to overfitting without proper tuning (e.g., learning rate and tree depth).
Cross-Validation of the 3 Models¶
## Cross-Validation for Logistic Regression
log_reg_auc = cross_val_score(best_log_reg, X_train, y_train, cv=5, scoring='roc_auc')
print("Logistic Regression AUC-ROC: ", np.mean(log_reg_auc))
## Cross-Validation for Gradient Boosting
gb_reg_auc = cross_val_score(best_gb, X_train, y_train, cv=5, scoring='roc_auc')
print("Gradient Boosting AUC-ROC: ", np.mean(gb_reg_auc))
Logistic Regression AUC-ROC: 0.951182664830219 Gradient Boosting AUC-ROC: 0.9632515843570179
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Assuming y_train contains binary class labels (0 and 1)
best_rf_clf = RandomForestClassifier()
rf_clf_auc = cross_val_score(best_rf_clf, X_train, y_train, cv=5, scoring='roc_auc')
print("Cross-validated ROC AUC:", rf_clf_auc.mean())
Cross-validated ROC AUC: 0.9459222845491043
Model Comparison: Cross-Validation AUC-ROC Scores¶
- Cross-validation AUC-ROC scores reveal the models' ability to generalize across different data splits:
- Logistic Regression (0.951):
- The high cross-validation AUC-ROC indicates strong discrimination ability across various datasets, despite a lower AUC on the test data.
- Suggests that the model is well-calibrated but limited in handling non-linear interactions, which may reduce its test performance.
- Random Forest (0.963):
- Random Forest shows slightly better generalization than Logistic Regression, reinforcing its robustness and ability to capture complex relationships.
- Gradient Boosting (0.963):
- Gradient Boosting marginally outperforms Random Forest in cross-validation AUC-ROC, suggesting it may generalize slightly better across unseen data when optimized correctly.
Insights¶
- 1. Gradient Boosting and Random Forest Are Superior:
- Both models achieve higher AUC scores and cross-validation AUC-ROC values than Logistic Regression, making them better choices for this dataset.
- The comparable performance of Random Forest and Gradient Boosting suggests that either model can be selected based on other factors like computational cost, interpretability, or ease of tuning.
- 2. Logistic Regression Is Still Valuable:
- Despite lower AUC scores, Logistic Regression remains valuable for its simplicity, computational efficiency, and interpretability.
- It may be the preferred choice if model transparency is critical.
- 3. Gradient Boosting Slightly Edges Out Random Forest:
- Gradient Boosting’s marginally better cross-validation AUC-ROC indicates its ability to refine predictions through iterative learning, especially when hyperparameters are finely tuned.
- 4. Room for Improvement:
- Fine-tune hyperparameters for Gradient Boosting (e.g., learning rate, tree depth) to potentially push its performance further above Random Forest.
- Explore ensemble methods like stacking to combine the strengths of all three models for a higher overall AUC.
Recommendation¶
- Primary Model: Gradient Boosting for its slightly better generalization and potential for improved predictions with hyperparameter tuning.
- Alternative Model: Random Forest for comparable performance, faster training, and robustness.
- Fallback Option: Logistic Regression if computational efficiency or interpretability is prioritized.
Calibration Curve for Probability Outputs¶
- Calibration curves evaluate how well the predicted probabilities match the true outcomes.
from sklearn.calibration import calibration_curve
# Calibration for Logistic Regression
prob_true_log, prob_pred_log = calibration_curve(y_test, y_pred_log, n_bins=10)
# Calibration for Random Forest
prob_true_rf, prob_pred_rf = calibration_curve(y_test, y_pred_rf, n_bins=10)
# Calibration for Gradient Boosting
prob_true_gb, prob_pred_gb = calibration_curve(y_test, y_pred_gb, n_bins=10)
# Plot calibration curves
plt.figure(figsize=(10, 6))
plt.plot(prob_pred_log, prob_true_log, marker='o', label='Logistic Regression', color='blue')
plt.plot(prob_pred_rf, prob_true_rf, marker='o', label='Random Forest', color='green')
plt.plot(prob_pred_gb, prob_true_gb, marker='o', label='Gradient Boosting', color='Orange')
plt.plot([0, 1], [0, 1], linestyle='--', color='red', label='Perfect Calibration')
plt.xlabel("Mean Predicted Probability")
plt.ylabel("Fraction of Positives")
plt.title("Calibration Curve")
plt.legend(loc="upper left")
plt.show()
1. Interpretation of Models¶
Logistic Regression (Blue):
- The curve is the furthest from the perfect calibration line (dashed red line).
- Indicates under confidence: The predicted probabilities tend to underestimate the true likelihood of the positive class.
Random Forest (Green):
- Closer to the perfect calibration line than logistic regression.
- Indicates better-calibrated probabilities but still slightly deviates, showing mild overconfidence at higher probability levels.
Gradient Boosting (Orange):
- Closest to the perfect calibration line, indicating the most reliable probability estimates among the three models.
- Excellent performance in producing probabilities that align well with actual observed frequencies.
2. Performance Insights¶
Gradient Boosting:
- This model is the most calibrated, making it the best choice for scenarios requiring precise probabilistic estimates, such as risk predictions or decision-making thresholds.
Random Forest:
- Performs relatively well but may require post-calibration (e.g., Platt scaling or isotonic regression) to improve the alignment of probabilities.
Logistic Regression:
- While simple and interpretable, its calibration could be significantly improved, especially for datasets with complex relationships.
3. Application Considerations¶
- If precise probabilities are critical (e.g., in healthcare for predicting patient risk or attrition), Gradient Boosting is the most reliable option.
- Random Forest may be used if slightly less precise probabilities are acceptable, but post-calibration can enhance its utility.
- Logistic Regression might be favored for its simplicity and explainability but may require recalibration techniques.
Recommendations¶
- Use Gradient Boosting for applications requiring highly reliable probability scores.
- Apply calibration techniques (e.g., isotonic regression) to Random Forest and Logistic Regression if these models are preferred for other reasons (e.g., interpretability or computation speed).
- Validate these results on additional datasets to ensure the calibration curves generalize well.
Further Evaluation of Gradient Boosting Model¶
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_recall_curve, f1_score, classification_report, accuracy_score
import matplotlib.pyplot as plt
# Load the dataset (replace with your dataset path)
df2 = pd.read_csv(r"C:\Users\austi\Downloads\diabetes_prediction_dataset (1).csv")
# --- Data Preprocessing ---
# Encode categorical features
df2['gender_encoded'] = df2['gender'].map({'Female': 0, 'Male': 1})
smoking_dummies = pd.get_dummies(df2['smoking_history'], prefix='smoking')
df2_encoded = pd.concat([df2, smoking_dummies], axis=1)
df2_encoded.drop(['gender', 'smoking_history'], axis=1, inplace=True)
# Create interaction term
df2_encoded['HbA1c_glucose_interaction'] = (
df2_encoded['HbA1c_level'] * df2_encoded['blood_glucose_level']
)
# Check for missing values and impute
df2_encoded.fillna(df2_encoded.median(), inplace=True)
# Define features and target
X = df2_encoded.drop('diabetes', axis=1)
y = df2_encoded['diabetes']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Replace NaN or infinite values in scaled data
X_train_scaled = np.nan_to_num(X_train_scaled, nan=np.nanmean(X_train_scaled), posinf=np.nanmean(X_train_scaled), neginf=np.nanmean(X_train_scaled))
X_test_scaled = np.nan_to_num(X_test_scaled, nan=np.nanmean(X_test_scaled), posinf=np.nanmean(X_test_scaled), neginf=np.nanmean(X_test_scaled))
# --- Gradient Boosting Model Training ---
gb_model = GradientBoostingClassifier(random_state=42, n_estimators=200, learning_rate=0.1, max_depth=5)
gb_model.fit(X_train_scaled, y_train)
# Predict probabilities
y_proba_gb = gb_model.predict_proba(X_test_scaled)[:, 1]
# --- Threshold Analysis ---
# Precision-Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba_gb)
f1_scores = 2 * (precision * recall) / (precision + recall)
# Identify the optimal threshold
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
# Metrics at optimal threshold
print(f"Optimal Threshold: {optimal_threshold}")
print(f"Precision: {precision[optimal_idx]:.4f}")
print(f"Recall: {recall[optimal_idx]:.4f}")
print(f"F1-Score: {f1_scores[optimal_idx]:.4f}")
# Predict using the optimal threshold
y_pred_optimal = (y_proba_gb >= optimal_threshold).astype(int)
# Evaluation Metrics
print("\nClassification Report (Optimal Threshold):")
print(classification_report(y_test, y_pred_optimal))
# --- Plot Precision-Recall Curve ---
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label="Precision-Recall Curve")
plt.scatter(recall[optimal_idx], precision[optimal_idx], color='red', label="Optimal Point")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Gradient Boosting)")
plt.legend()
plt.grid()
plt.show()
# --- Additional Evaluation ---
# Accuracy
accuracy_optimal = accuracy_score(y_test, y_pred_optimal)
print(f"Accuracy at Optimal Threshold: {accuracy_optimal:.4f}")
Optimal Threshold: 0.9999979408643517
Precision: 0.0000
Recall: 0.0000
F1-Score: nan
Classification Report (Optimal Threshold):
precision recall f1-score support
0 0.91 1.00 0.96 18300
1 0.00 0.00 0.00 1700
accuracy 0.91 20000
macro avg 0.46 0.50 0.48 20000
weighted avg 0.84 0.91 0.87 20000
Accuracy at Optimal Threshold: 0.9150
import matplotlib.pyplot as plt
# Plot Precision-Recall Curve
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, label='Precision-Recall Curve')
plt.title('Precision-Recall Curve for Gradient Boosting')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.axvline(recall[optimal_idx], color='red', linestyle='--', label=f'Optimal Recall: {recall[optimal_idx]:.2f}')
plt.axhline(precision[optimal_idx], color='blue', linestyle='--', label=f'Optimal Precision: {precision[optimal_idx]:.2f}')
plt.scatter(recall[optimal_idx], precision[optimal_idx], color='red', label='Optimal Threshold', zorder=5)
plt.legend()
plt.grid()
plt.show()
# Plot F1-Score vs Thresholds
plt.figure(figsize=(10, 6))
plt.plot(thresholds, f1_scores[:-1], label='F1-Score Curve')
plt.title('F1-Score vs Thresholds')
plt.xlabel('Threshold')
plt.ylabel('F1-Score')
plt.axvline(optimal_threshold, color='red', linestyle='--', label=f'Optimal Threshold: {optimal_threshold:.2f}')
plt.legend()
plt.grid()
plt.show()
# Plot Recall vs Thresholds
plt.figure(figsize=(10, 6))
plt.plot(thresholds, recall[:-1], label='Recall Curve', color='orange')
plt.title('Recall vs Thresholds')
plt.xlabel('Threshold')
plt.ylabel('Recall')
plt.axvline(optimal_threshold, color='red', linestyle='--', label=f'Optimal Threshold: {optimal_threshold:.2f}')
plt.legend()
plt.grid()
plt.show()
# Plot Precision vs Thresholds
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precision[:-1], label='Precision Curve', color='green')
plt.title('Precision vs Thresholds')
plt.xlabel('Threshold')
plt.ylabel('Precision')
plt.axvline(optimal_threshold, color='red', linestyle='--', label=f'Optimal Threshold: {optimal_threshold:.2f}')
plt.legend()
plt.grid()
plt.show()
# --- Additional Evaluation ---
# Accuracy
accuracy_optimal = accuracy_score(y_test, y_pred_optimal)
print(f"Accuracy at Optimal Threshold: {accuracy_optimal:.4f}")
Accuracy at Optimal Threshold: 0.9150
Key Insights Across All Plots:¶
1. Model Performance:¶
- The model demonstrates strong predictive capabilities, with high precision-recall balance and a peak F1-score suggesting effective trade-offs.
2. Threshold Optimization:¶
- Selecting an optimal threshold is critical and should align with the business context:
- For high-stakes decisions (e.g., healthcare), prioritize recall to minimize false negatives.
- For cost-sensitive applications, prioritize precision to reduce false positives.
3. Gradient Boosting Strength:¶
- These curves show that Gradient Boosting is well-calibrated for both high precision and recall.
Recommendations:¶
1. Threshold Selection:¶
- Choose the threshold corresponding to the highest F1-score for general-purpose optimization.
- Alternatively, tailor the threshold to business-specific goals, balancing recall and precision appropriately.
2. Additional Validation:¶
- Evaluate performance on a holdout test set or through cross-validation to ensure consistency.
3. Calibration:¶
- If the Precision-Recall Curve deviates from expectations, consider recalibrating the model to align with domain-specific needs.
Conclusion¶
- Use Gradient Boosting for applications requiring highly reliable probability scores.
- Apply calibration techniques (e.g., isotonic regression) to Random Forest and Logistic Regression if these models are preferred for other reasons (e.g., interpretability or computation speed).
- Validate these results on additional datasets to ensure the calibration curves generalize well.